Understanding the European Football

In this study, we analyze football (soccer in the US) events through a dataset composed of 941009 events collected from 9,074 European games. The dataset is open source and available to download at https://www.kaggle.com/secareanualin/football-events/data. Details of the events are represented through 22 features, including the type of event (11 varieties, including foul, corner, offside, and red card), the minute in which the event happens, the side of the team involved in the event (home or away), the location of the event in the pitch (e.g., defensive half, attacking half, penalty spot), the game situation in which the event happens (e.g., open play, corner, free kick). If the event is a shot, the dataset provides also information of the shot place (e.g., bottom left corner, top left corner, center of the goal), the shot outcome (e.g., on target, off target, blocked), the part of the body involved in the shot (i.e., left foot, right foot, head), the method used to assist the shooter (e.g., pass, cross, headed pass), and a boolean variable that indicates whether the shot ends up as a goal. There are supplementary data about the events such as the team that generates the event or the players involved.

Data features and transformation

This study focused on 11 of the features available in the dataset, namely: event type, time, side, location, situation, shot placement, shot outcome, body part, assist method, and event team generator. To facilitate the analyses, the codes of the categorical variables are replaced with their textual values. The mapping between categories and values is recorded in a dictionary file that can be download together with the dataset from https://www.kaggle.com/secareanualin/football-events/data.

# Replace codes for their corresponding values
for (current_category in unique(dictionary$category)) {
  if (current_category %in% names(events)) {
    data_category = filter(dictionary, category==current_category)
    for (code_value in data_category$code) {
      events[!is.na(events[,current_category])&
               events[,current_category]==code_value,current_category] = 
        data_category[data_category$code==code_value,'description'] 
    }
  }
}

Next, it is shown five examples of events together with the interesed features.

event_type time event_team side location situation shot_place shot_outcome bodypart assist_method is_goal
11 Free kick won 14 Borussia Dortmund Home Left wing NA NA NA NA None 0
12 Attempt 14 Borussia Dortmund Home Outside the box Open play Top right corner Off target right foot Pass 0
13 Foul 14 Hamburg SV Away NA NA NA NA NA None 0
14 Attempt 17 Borussia Dortmund Home Left side of the box Open play Bottom right corner On target left foot Pass 1
15 Attempt 19 Borussia Dortmund Home Outside the box Open play Blocked Blocked right foot None 0

Univariate Plots Section

This part of the study analyzes the 11 selected features, namely: event type, time, side, location, situation, shot placement, shot outcome, body part, assist method, and team generator. No new variables were created for this part of the study.

Event team generator

The dataset contains information about events generated by 142 teams from Italy, Germany, France, Spain, and England. Taking a closer look at
the top-20 teams that generate more events, it can be seen that half of them belong to the Serie A (Italy), namely: Juventus, Fiorentina, AC Milan, AS Roma, Genova, Lazio, Atalanta, Napoli, Chievo, and Inter. Out of the rest, five teams are Spanish and five play in the Ligue 1 in France.

Event Types

Fouls and free kicks are the two most common events in games, followed by goal attempts (see even types figure) — they represent 74.4% of the events in the dataset. The high occurrence of fouls and free kicks demonstrates that games are apparently not fluent but often interrupted by illegal plays and the resulting direct or indirect kicks. It is interesting to notice that even when fouls are among the most common events, games are characterized by fair plays considering the low frequency of yellow and red cards (4.4% of the total events). The figure shows that corners are not that common, which seems to indicate that goal attempts come more from open plays (e.g., give and go) than through set pieces. The low occurrence of penalties (0.3% of the total events) indicates that fouls are committed outside the goal area.

Time

The figure depicts that there is a slight but consistent increment in the number of events from the first minute to the end of the game. The final minutes of each half (45th and 90th) concentrate the largest amount of events.

Side

The amount events per side are balanced between those related to home teams (51.9%), and the ones belonged to away teams.

Location

Less than half of the records (49.6%) have the registration of the location of the pitch where the event happened. Records without location are not included in the calculation of the figure above. Some results in the location analysis might complement the insights obtained in the analysis of the event types. For example, fouls might occur in the majority of cases in the defensive half and probably outside of the box. Further analyses are required to better understand the relationship between these variables.

Game situation

Only 24.4% of the records are equipped with the situation that triggered the event. Records without mention of the game situation are not considered to draw the figure above, which shows that in most cases events start from open plays. To a lesser extend, corners, set pieces, and free kicks cause events as well.

Shot place

Out of the 229135 attempts, 227459 (99.3%) registered where the shot ends up. Records without information about the shot placement are excluded from the figure above (713550 records). Most of the shots are blocked, of the remaining a large number of shots miss to one of the sides (left or right) or go to the center of the goal.

Shot outcome

Out of the 229135 attempts, 228498 (99.7%) registered the shot outcome. The figure above does not include the records that do not report the shot outcome (712511 records). In the figure we can see that in the majority of cases shots do not get on the target. In this sense, shooters have a probability of 0.34 (number of shots that get on the target divided by the total number of shots) of hitting the target.

Body part

All of the events of type attempt are equipped with information about the body part involved in the event. In the figure above we can see that the majority of attempts (53.2% of the attempts) are performed using the right feet.

Assistance methods

Passes represent the most common assistance method followed by crosses. It is interesting to see that crosses are not frequently used in the top leagues of Europe as a method to assist strikers. This can be different in other parts of the world, like South America, where the cross is a very common assistance method.

Was it goal?

The vast majority of events were not goals. Out of the 229135 attempts, 24446 were goals (effectiveness of 10.7%).

Univariate Analysis

Some interesting insights can be already derived from the previous exploration of the event features. Fouls and their corresponding free kicks, and goal attempts are by far the most common events in European football games. We can also see that events tend to happen in the defensive half and that they are concentrated primarily at the end of the first half and end of the game. In the vast majority of cases, there are not any assistance involved, which is expected considering that a large part of events is fouls and free kicks. We also found that only 10% of attempts end up as goals.

Bivariate Plots Section

In this part of the study, we are interested in further understanding the attempt events, especially those that end up as goals. Through the next plots, we will explore the relationship between teams and attempts, goals and assistance method, goals and time, goals and body part, goals and game situation, and goals and location.

Teams and attempts

Interestingly, 70% of the top-ten teams with most attempts are from the English Premier League are (i.e., Manchester City, Liverpool, Arsenal, Tottenham, Chelsea, QPR, and Southampton) showing that teams in England are strongly characterized by an offensive play.

Attempts and assitance method

We saw previously that, in general, any assistance method is involved in events. However, when analyzing exclusively attempt events, which end up as goals and those which do not, passes are the primary method of assistance. The distribution of assistance methods are pretty similar for both unsuccessful attempts and goal attempts, except for the method through ball, which appear to be more common in attempts that end up as goals.

Attempts and Time

Attempts that do not end up as goals show to be distributed uniformily throughtout games however peaks of goals are found in two particular moments of games, at the end of the first-half (minute 45) and the end of the game (minute 90). The distribution of attempts are similar in both goals and no goals. Having goals at the end of the game is somehow expected because in the case of draws both teams probably try hard to untie or in situations where one of the team is leading the score, the losing team makes its best effort to avoid a defeat. However, it is interesting to see that something similar happens, on a smaller scale, also at the end of the first half.

Goals and body part

As in the general case, goals are in the majority of cases scored with the right foot.

Goals and game situation

The vast majority of goals come from open plays, which might indicate that European teams are not that effective in set pieces, free kicks, or corners. Set pieces appear to be more common in goal attempts than in attempts that do not end up as goals.

Goals and location

Teams score in the majority of cases inside and in the center of the box (13%), which is something expected considering the closeness to the goal. For every goal from outside the box, there are 2 goals scored in the center of the box. Even when goals are mainly scored from the center of the box, it is interesting to see the large proportion of attempts that are also miss from this location.

Bivariate Analysis

The previous analyses help us to discover that goals are commonly scored at the end of the first and second half of games. They are the result of passes generated through open plays.

Multivariate Plots Section

In this section, we will explore some characteristics of the teams’ performance, like its effectiveness and its fair-play. By effectiveness, we mean the number of goals by the number of attempts. The fair-play of a team is measured by the proportion of fouls conceded by the number of events recorded for this team.

Teams effectiveness

The red bubbles represent the top-10 most effective teams. The team effectiveness is specified by both the size of the bubbles and by a ratio shown in parenthesis under the team name. Barcelona shows to be the most effective team in Europe, meaning, they are the team that best takes advantage of the generated goal opportunities.

Characteristics of Barcelona’s goal attempts

Let’s explore the Barcelona’s goal attempt characteristics trying to identify relationship between field locations, game situations, and assitance methods.

Field location Game situation Assitance method Number of goals
Centre of the box Open play Pass 137
Centre of the box Open play Through ball 45
Penalty spot Set piece None 44
Centre of the box Open play None 38
Very close range Open play Pass 36
Outside the box Open play Pass 34
Centre of the box Open play Cross 30
Left side of the box Open play Pass 20
Very close range Open play None 18
Right side of the box Open play Pass 14

Interestingly Barcelona’s goal attempts follow the general patterns. That is to say; as in general, their goals occur mainly in the center of the box and are the result of passes generated through open plays.

Teams and the fair play respect

The red bubbles are used to represent the teams that conceded the less number of fouls about the total number of recorded events. The size of the bubbles, as well as the number in parenthesis, show the ratio between the number of fouls and number of events recorded for the team represented in the bubble.

Multivariate Analysis

After analyzing the teams’ effectiveness (i.e., goals over attempts), Barcelona appears as the most effective team in Europe according to the data at hand. Taking a closer look at the characteristics of the Barcelona’s goal attempts, we can see that most of its goals come from open plays that happen at the center of the box. Passes are the most common assistance method in Barcelona’s goals. These results are not surprising considering that this team has been dominating the world football in the past years through its ball-possession game style (well-known as tiki-taka). The low number of fouls is a direct consequence of the tiki-taka style of play since Barcelona maintains the control of the ball most of the time during games. They win games by holding the possession of the ball and keeping away from their contraries.


Final Plots and Summary

Event Types

The plot above shows the distribution of events by type. Free kicks, fouls, and attempts represent the most common events.

Team attempts

This plot depicts the proportion of attempts by the total of events of the top-10 most offensive teams. The large majority of teams in this ranking belong to the English Premier League.

The effectiveness of teams

The plot in this section illustrates the relationship between attempts and goals, highlighting in red the most effective teams, which it is to say, those teams with the highest number of goals over their number of attempts. Under the name of the team, it is shown the ratio of goals over attempts.


Reflection

Process and challenges

This study is based on data about football events. None data transformation task was required to perform since the dataset was already in a tidy format. The only pre-proccesing task performed was the replacement of the codes of the categorical variables with their values. After choosing the features to be studied, we explored them through univariate, bivariate, and multivariate analyses. Considering the most of the selected features are categorical, it was a challenge to find the correct way of plotting them in bivariate charts. We decided to use stacked bar charts to explore relationships between categorical features. When possible, we employed boxplots. Also, understanding some categories of features was not easy. We had to look up espcialize specific (e.g., Football Terms, Assist Football) to learn the meaning of, for example, through ball when talking about assistance methods or set piece in the case of game situations.

Insights and opportunities for future work

It was found that a large number of attempts do not necessarily mean more goals. For example, we saw that Barcelona is not the team with most attempts (see the first plot in the previous section) however it is the most effective, demonstrating that score goals is not neccesarily a matter of the number of attempts but the result of coordinate sequences of passes and movements, maintaining the ball possession until beating the defense of the contrary. Long ball possession allows Barcelona to also concede few fouls since they control game most of the time. It remains unexplore and for future work, whether Barcelona is among the teams that most fouls receive. Also, as future work, it would be interesting to further understand the characteristic of the Barcelona’s offensive play, for example, what is in their case the most effective combination of field location, game situation, and assistance method.